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Abstract: This paper gives a theoretical analysis of high dimensional lin- 
ear discrimination of Gaussian data. We study the excess risk of linear 
discriminant rules. We emphasis on the poor performances of standard 
procedures in the case when dimension p is larger than sample size n. The 
corresponding theoretical results are non asymptotic lower bounds. On the 
other hand, we propose two discrimination procedures based on dimension- 
ality reduction and provide associated rates of convergence which can be 
O ^ *°g(p) ^ under sparsity assumptions. Finally all our results rely on a the- 
orem that provides simple sharp relations between the excess risk and an 
estimation error associated to the geometric parameters defining the used 
discrimination rule. 
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1. Introduction 

In the binary classification problem, the aim is to recover the unknown class 
y e {0, 1} associated to an observation x G X ~ W. In other words, we seek 
a classification rule, also called classifier: a measurable g : X {0, 1}. This 
rule gives a wrong classification for the observation x S R*^ if g{x) ^ y. The 
underlying probabilistic model, that allows us to measure the performances of 
a classification rule is set by a distribution PonA'xjO,!} with conditional 
probability Pfc() = P(. x {k}) {k = 0,1). In this framework, under a uniform 
prior, the probability of misclassification is defined by 

C{g) = i (Pi(X i g-\l)) + Po{X i .g-i(O))) . 

In this paper we consider the case when Pq and P\ are gaussian with mean \xq 
and [i\ respectively and with common covariance C. Since then, when X = W ^ 
the Bayes rule, i.e the classification rule g* that minimizes C((?), is given by 

if (Pio,a;- sio)rp > 
^ ^"^^ ~ \ otherwise ^ ' 

where Pio = C~(/ii - /.to), Sio = t ^° , 
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C is the generalized inverse^ of C and ( , )kp is the eucHdian inner product of 
MP. Since /ii, /io and C are unknown, g* is unknown. Assume that one observes 
two independent samples X° = {X°,..., = {Xl,..., X^J of X valued 

i.i.d observations with probability distribution Pq or Pi, respectively. One can 
use empirical rules gng^m based on the observations X^,X^ to mimic g* . When 
one assumes that Pi and Pq are gaussian with the same covariance, it becomes 
natural to search for a classification rule g : — > {0, 1} given by 

_ / 1 if (Ao, X - sio)mp > .„^ 
^^"^^ " 1 otherwise 

where Pio, sio G have to be estimated from the observations X^. 

A standard way of assessing the quality of a decision rule ^„ (where n = 
ni + no) is to give an upper bound on E[C(5„) — C{g*)]. A classification rule gn 
is said to be consistent if this last quantity converges to zero when n — > oo. In 
this paper, we are interested in the case where p >> n ~ {p is the dimension of 
X), and our aim is twofold. First, we give two procedures to achieve the fast rate 
of convergence. These procedures rely on a dimensionality reduction. Second, 
we give lower bounds on the excess risk to show that standard procedures (such 
as the Fisher discriminant analysis) fail in high dimension (when p > n). These 
lower bounds are given as a function of the sample size n and the dimension p. 
They are not asymptotic lower bounds since these bounds remain valid for all 
the cases when p > n. 

Let us introduce some notations that will be used throughout this paper. If P 
is a probability measure on W with finite second order moment and it, w G R^, 
Iklli2(-P) '^ili stand for the p2(P) norm^ of a; € R^ — > {v,x)rp, and {u,v)]^^(^p^ 
will stand for the associated scalar product. This scalar product induces a ge- 
ometry in MP, the associated angle in L2{P) between u and v will be denoted 
by ai2(p)(u,v). In the rest of the paper, Pc will stand for a gaussian centered 
measure with covariance C . 

Our main result in this paper is Theorem 3.1. There, we see that when 

1. Pio has a finite number of non null components (sparsity assumption) 

2. ||Pio||_L2(Pc,) lower bounded ( strict margin assumption) 

then the procedure we are proposing achieves the rate log(p)/ri. Finally, our 
theorem also shows identical rate of convergence for other types of sparsity as- 
sumption and margin assumptions. 



^If C is a semi-positivo definite matrix, one can define the associated generalized inverse, 
also called Moore-Penrosc pseudo-inverse : C~ . This generalised inverse C~ arises from the 
decomposition V = Ker(C)®Ker{C)^. On Ker{C), C" is null, ad on Ker{C)^, C" equals 
the inverse of C = C'|ifer(C)^ ( ^ ''^'^ restriction of C to Ker(C)^). 

■^Let us recall that the L2{P) norm of f : x & W ^ f{x) is defined by ll/lll^^^pj = 
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There is a large body of literature about lower bounds on the excess risk in 
the classification framework, one can see for example [19, 2, 18, 22, 21]. These 
articles are mainly dedicated to the problem of finding the minimax rate of con- 
vergence in certain classes of classifiers. These classes cannot be adapted to our 
case. Moreover, we do not search minimax lower bounds. 

The classification rule we propose is a linear discriminant analysis with a 
dimensionality reduction procedure. This type of discrimination procedure in a 
high dimensional gaussian framework has been investigated in [8, 15, 20, 4, 12] 
and our work is in line with these papers. The main improvement we give is that 
the full proposed procedures (including the use of a data dependent threshold) 
come with a rate of convergence that can be the fast rate under a wide range of 
sparsity assumptions. In our work we relate classification error and error made 
while estimating Fio and sio, also our work is related to the area of plugin 
classification. Our theoretical development is centered on Theorem 5.1. There, 
we give a bound exhibiting a good relation (sharp lower and upper bound) 
between the estimation error of Fiq and the excess risk and this has never been 
investigated. 

This paper is structured as follows. In Section 2 we give finite sample lower 
bounds showing how bad are standard procedure for finding Fio when p » n. 
In section 3 we give two algorithms to overcome these problems together with 
associated theoretical results and numerical experiments. The proofs, and the 
statement of Theorem 5.1 are postponed to the Annex. 

2. Inconsistency of standard procedure when there are more 
variables than observations 

Within the learning set, we observe two independent samples X'^ = {^ii • ■ • ) -^no)' 

= (Xl , . . . , X^^ ) of W valued i.i.d observations with probability distribution 
Pq or Pi , respectively. The following proposition illustrates the inconsistency of 
standard procedures when p > n ~ ni + uq. 

Proposition 2.1. For k = 0,1, let & he the empirical covariance matrix of 
X^ , and pk he the empirical mean of X^ . Let us define 



• If FiQ = C (pi — pa), then, the classification rule g defined by (2) leads 




and let sio he any estimator of sio G W. 

• If Fw ~ C^miQ, then, the classification rule g defined hy (2) leads to 




to 



Ep^4C{g)-C*]> 1 



Vn\\Fm\\L2(^P^) + 1 \ ||-Fio||l,(Pc) 



5||fl0lli^(f,^) 



Vp~~2 J 2V2^ 



e 



s 
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General comments First, we note that d = — ^° ^^'-^c' is related to the Li 
distance bewteen Pq and Pi through this known equality: 

di(Pi,Po) = J \dPi - dPo\ - - 

where $(a;) is the cumulative distribution function of a real gaussian random 
variable with mean zero and variance one. Hence di{Pi, Pq) ~ d when d tends to 
zero. In this case, the preceding lower bound is tight since C{g)—C* < di{Pi, Pq). 
When di{Pi,Po) 1, d ^ oo and 

di(Pi,Po)^l-^. 

As a particular application of this proposition, we see that the Fisher Rule is 
not consistent when p >> n, which was already given in [4]. However, our result 
is stronger: we can even say that if there exists 1 > c > such that ^ < c, then 
the Fisher rule is not consistent. 



Structural assumption. The preceding proposition suggests that in the prob- 
lem of estimating Fiq to construct a consistent rule g (as given by Equation 2), 
when p >> n, a structural assumption on (C~)^/^(/xi — /ip) has to be made 
(by abuse of notation we will write C~^/^mio in the remaining of the paper). 
Indeed, from point 2 of the proposition, if there exists < r < i? such that 
R > ll^io||/„(p^) ^ f-, then, uniformly on all the possible values of /ii and 
the excess risk can converge to zero only if ^ tends to 0. Recall that if no a 
priori assumption is done on mio, fhiQ is the best estimator of mio with respect 
to the quadratic loss: mio = Argminf^x'^ .x°)^[\\'niw — /(^""^i -'f*')llKp]- 

In the literature of high dimensional classification, the mean difference vec- 
tor TOio = (Ui — /io is commonly believed to be sparse (see [12]). In this paper 
C~^/^mio is assumed to be sparse. Intuitively, the sparsity assumption permits 
to bound the dimension of MP subspace for which the classification can be per- 
formed efficiently, and it is sufficient but not necessary to relate this space to 
the sparsity of mio only. Indeed, there can be a direction e S such that 
e = argmaxjjeii^i (mio, e-i) but e = argmm||e||^i (C^^^^mio, e-;) and it is natu- 
ral to take into account the overall dispersion of the data as well as the mean 
difference vector. 

Theoretically, the choice of a sparsity assumption on C^^/^mio is enlighten 
by Theorem 5.1. Indeed, this Theorem implies that if C- < !|^io|1l2(Pc) < ^+ 
(for C_, C+ > 0), there exists < Ci < C2 such that 

Cia^ <C{g) - C* < C^a^ + .f{ho) 

where a = a^^^jp^ )(Fio, -Fio) is the angle between Fio and Fio in the geometry 
of 1/2 (^b) and f -.Wp ^ with f{sio) = 0. This explains why an assumption 
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on the sparsity of Fio in L2{Pc) (or a sparsity of C ^/^mio) is more suitable. 

The structural assumption on C^^/^mio can be a consequence of structural 
assumptions on — //q and on C . Many works, based on model selection or 
aggregation have already been done to define proper structural assumption for 
the estimation of C, see for example [5] and the reference therein. Those works 
are dedicated to the problem of estimating C with a Hilbert-Schmidt error 
measure, and yet do not give results in the classification framework. In addition, 
we will see in next section that it is not necessary to estimate all the parameters 
of C but that one only need to estimate Fiq which has only p parameters. 
If a structural assumption is done on C, it has to be linked with a statistical 
assumption. For example reducing the number of parameters to estimate can 
be done with a stationarity (or quasi stationarity, as in [17]) assumption. If C is 
Toeplitz (i.e dj = c{i — j) with c : Z — >• R a p-perioric sequence) it is a circular 
convolution operator which is known to be diagonal in the discrete Fourier basis 
(5™)o<m<p defined by: 

(9 ). = ^exp(^--j. 

This is a generalization (to the infinite dimensional framework) of this harmonic 
analysis result that is used in Bickel et Levina [4] and combined with approx- 
imation in [17]. Using this type of assumption, the covariance matrix can be 
searched in the set of diagonal matrices. Let us note that the use of harmonic 
analysis and stationarity in curve classification can become a wide field of inter- 
est as soon as one considers the larger class of group stationnary-processes (see 
[23]) or semi- group stationnary processes (see [14]). 

However, we believe that making directly a structural assumption on C~^/^mio 
is more suitable in the case or our classification problem. In the estimation of 
a high dimensional vector problem, finding suitable structural assumption has 
been studied extensively (see for example [6]). In this paper, we limit our work 
to bodies for < g < 2. Let Pc be a gaussian measure on Rp with full rank 
covariance, for < g < 2 let us define l''{R,Pc) the l'^ ball of L-ziPc) with 
radius i? > by 

l''{R,Pc) = {veW : \\C^/^v\\l < R"} , 

where \\x\\^ = Y^^=i NWI' fo'" ^ ^ Fo'" ^ '^^11 chosen orthonormal basis 
of M^, knowing that Fio £ l'^{R, Pc) for < (7 < 2 will be used (see next Section) 
to construct a consistent estimator of Fiq. 

3. Fast rate of convergence for linear discrimination rule 

In this section we suppose that C is diagonal, and use the notation [i] = C[i, i] . 
The learning set {Xj)k=o,i^ j=i,...,nk is separated in two parts, part A and part 
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i3, with equal size: 

Part A = (Xj')fc=o,i, i<j<"A,/2 and Part B = (Xj')fc^o,i, nj2<j<n;,- 

For A: = 0, 1 let (resp p.^) be the empirical mean of the learning data from 
part A (respectively from part B) and class k. For i = 1, . . . ,p, fc = 0, 1, let aHi] 
be the empirical (unbiased) variance of the i*'' feature within the learning data 
from part B: {X^\i])k ,nk/2<j<nk and define = -p;hj{{no - l)o-o[i] + ("i - 
l)af[i]). Now, let us define 

sio = — = fif-fiQ, a = ((5-[z])i=i...,p, and Fio = ("t-io W/o"^ 

(3) 

We recall that in this paper, n = n\ + tiq. We will note 

r!,(i?,r) = {(Pi,Po) eT' s.t Fio G /n^,^Co.(Pi)), II^ioIIl.(Pc) > '^l (4) 

where V is the set of pairs [Pi^P^) of gaussian probability distribution on V 
with cov(P\) = coviP-}). 

Definition of procedures. We propose two discrimination procedures. The 
first one is simpler and comes with a more complete theoretical result while the 
second one is more sophisticated but requires further theoretical work. Both use 
the discrimination procedure g (defined by Equation (2)) with sio defined by 
Equation 3. In both cases F\q is evaluated upon a dimensionality reduction step 

Given that we know /, the preceding rule is a rephrasing of the feature 
annealed independence rule introduced in [12] in the case when group and 
group 1 have equal variance. The proposed methods differ by procedure used to 
construct / (even if the result we give in Theorem 5.1 applies to the case when 
C is not diagonal.). We propose to use two simple procedures borrowed from the 
thresholding estimation literature (Procedure 1 and 2 below) for selecting the 
subset / in order to estimate the normal vector F\q to the optimal separating 
hypcrplane. Procedure 3 is the thresholding procedure proposed in[12]. 

Procedure 1 : universal dimensionality reduction. In the first procedure / is 
given by 

mio[i] 



i e {l,...,ri : 



>V2^ (6) 



This can be seen as a thresholding estimation of C~^/^mio with a universal 
threshold (see for example [11]). The next procedure relics on the same idea 
with a false discovery rate thresholding procedure. 
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Procedure 2 : False discovery rate control dimensionality reduction. In the 
second procedure / is given by 



fFDR 



= {i(^{l,...,p} 



am 



> A 



FDB. 



(7) 



where Xfdr is a data dependent threshold chosen with the Benjamini and 
Hocheberg procedure [3] for control of the false discovery rate (FDR) of the 
following multiple hypothesis : 



Vi = l,...,p iJ, 



Oi 



E 



: Versus Hu 



E 



mio[zJ 



a i\ 



^0 (8) 



This procedure is as follows. Let us define T[i] = . The are ordered 

in decreasing order : 

mm >■■■> \T[ip)]\ and Afo^^ = |T[(fcfo^^)]| 



where k[ff^ = max 



|fcea...,ri:™]|>/i.(M)| 



z{a) is the quantile of order a of a standardized gaussian random variable and 
bp € [0,l/2[ is under bounded by where cq is a positive constant (which 
does not depend on p) . 

The procedure can also be seen as a thresholding estimation of C~^^^mio, 
but with a FDR threshold (see [1]). There are a lot of thresholding procedures 
in the literature today and others could be used. The universal threshold is 
the first that has appeared and the simplest. The FDR threshold is one of 
the most efficient and adaptive one. In addition, in our problem, it can lead 
to an interesting statistical rephrasing of the procedure. Indeed, the multiple 
hypothesis given by Equation (8) are connected heuristically with \/i = 1, . . . ,p 

Hf)i : the ratio variance inter/variance intra is null in direction i 

Versus 

Hii : the ratio variance inter/variance intra is not null in direction i. 
Hence our procedure can be rephrased in to step : 

1. Make a "vertical analysis of the variance" to select the directions i e / in 
which the data are well separated (i.e {C~^^^mio)[i] is large) 

2. Perform a standard discriminant analysis in the space spanned by the 
directions chosen in step 1. 

Procedure 3 : threshold choice from [12]. In procedure 3, /^^^^ is computed 
the same way as I^^^ replacing k^^^ by: 

1 ri[J2T=iT^t)+mil/ni-l/no) 



„FAIR 



Argmaxm=i,...,p ' 



maxj<m a-2[(i)] mnino + nino X]" i 



(i) 



(9) 
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Procedure 4 • Higher Criticism from [8]. In procedure 4, j^C" jg computed with 
the higher Criticism procedure [8]: with Xhc ~ I^^K^*^^)] where 

where k"^ = Argmaxi<k<pq \ ~ '^t^^^ l I ^ (lo) 

[ y/k{p-k) J 

with Vfc = 1, . . . ,p 7r[(fc)] = 2(1 - $(|T[(fc)]|) and $(2;) = P{M{0, 1) < x). 
Theoretical result and comments 

Theorem 3.1. Let g be defined by Equation (2) with Fio as given by Equation 
5. 

1. Suppose we are using as defined by Equation 6. Assume there exists 
r,R > such that < r < Iji^iolUaCPc) — ^ '^"^ ^^^^ log(p) << \/n. 
Then, there exists c{R) > such that 



r 



(l+7e(C-i/V,io,n)) +o(^- 



(11) 



where 

2. Suppose now that we are using j^^^ as defined by Equation 7. Suppose 
that in Equation 8 and in the definition of Xfdr below this equation, CT[i\ 
equals C[i,i]. Define rjp = p~pRy^n{p). If G [ '"s^^^) ,P^^] for S > 0, 
then, for all < q < 2 we have 

W>0, sup ^R^ACi9)-n<^-^('^^jf$^^ 

(Po,Pi)ef2,(_R,e,r) 2r ^ 2Rn^/^{p) 

' (12) 

c(6p) = l + ^^ + 0p(l), 

where bp is the real value used for the choice of kf^f^ , and P®" is the law 
of the learning set. 

Comments about point 1 and general comments. The bound given by Equa- 
tion 11 can lead to a rate of convergence if one know a suitable bound for 
72.(C^^/^mio, n). These type of bounds are well known (see for example Lemma 
6.1 in [6]) and we won't give further comment. As an example, when the number 
of non null components of C^^^mio is bounded by S, we have 

7^(C^^/2mlo,r^) < 5, 

which implies that 

Ep«„[C(5)-C1 = 0^^1°s(p) 
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The assumption ||^io||l2(Pc) — ^ could be relaxed with additional tech- 
nicalities in the proofs. Anyway it is easy to understand that large values of 
ll^iolli, (Pc) correspond to the case where the data are well separated and is not 
of great interest. In addition, it is often needed implicitly when one wants to 
bound 7^(C"l/2mlo,n). 

The assumption log(p) << y/n can be seen as a rather strong assumption for 
very large n. It is needed to show that the use of a[i\ in (6) gives almost the 
same result as the one we would have by taking a[i] instead. 

Note that, for certain values of q €]0,2[, the rate of convergence can be 
fast (i.e faster than n~^/^) under the condition that C^^^^mio S On the 
other hand, assuming that r > cannot tend to zero can be seen as a margin 
assumption, since 

WFioh^iPc) >r>0 =^ 3O0 : Ve>0 P(|l - 2i]{X)\ < e) < Ce. 

where r7(X) = E[y|X]. Apart from Theorem 5.1 (from which Theorem 3.1 can be 
derived) the theoretical novelty of this paper is to give upper bound on the excess 
risk for procedure involving a particular dimensionality reduction (Procedure 1 
and 2 for the choice of /). In Bickel and Levina [4] no thresholding procedure 
is proposed and in Fan and Fan [12] the choice of the threshold is introduced 
after the main theoretical result to mimic the oracle bound of their Theorem 5. 
In addition, most results in Fan and Fan [12] are established in the case where 
C = Id. Let us recall that if F is a gaussian random variable with values in 
a Hilbert Space, then the covariance operator is necessarily nuclear. Also, the 
assumption used by the above mentioned authors cannot let us consider, as a 
limiting distribution when p tends to infinity, gaussian measures with support 
in a Hilbert space. 

Finally, even if Theorem 3.1 doesn't treat the case where C is not diagonal 
Theorem 5.1 gives hints in that direction and extending our work with ideas 
from Bickel and Levina [4] will be the purpose of a further study. 

Comments about point 2. One can use the inequality (obtained at point 4 of 
the comments of Theorem 5.1) 

E[C{g) -C*]<c (E[i|Fio - AoIIL(Pc)]) ( ^ > 

to handle the case where ||fio||L2(Pc) '^^^^ tend to zero when p tends to infinity 
(no margin assumption). The rate of convergence is not anymore the fast rate. 

In point 2, the rate of convergence is faster when q is close to 0, and slower 
when it is close to 2. This leads to consider the sparsity of C^^/^(/^o ~ l^i) as a 
vector of IR^ in a well chosen basis. 

The constant c{bp) does not depend on q g]0,1/2[. We could obtain the 
same speed with a universal threshold {Xu = ^y2\og{p)) . In that case, the 

constant ^^^^ would be larger (cf [1]). 

In the case of the FDR reduction dimension technique the assumption about 
a[i] is unrealistic. We do not think the result is still true without this assumption 
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because the obtained numerical results are rather poor. Avoiding this assump- 
tion with a slight change of the procedure could be done in further work in 
relation with the work in [1]. 

3.1. Numerical Results 

We present here numerical results obtained with the presented procedures. 
Hence, we evaluate error rate of 6 procedures using Equation (2) : 

1. the procedure obtained by taking Fio — C^{jli — po) where C his the 
diagonal matrix with C[i, i] the empirical variance of (^j^[i])j=i,...,n fc=o,i- 

2. the procedure obtained by taking Fio as given by Equation 5 and 
as defined by Equation 6. 

3. g^^^ the procedure obtained by taking Fiq as given by Equation 5 and 
jFDR defined by Equation 7 with q = ^/log{p) and 7 chosen by 10-fold 
cross validation over an exponential grid of {10°, 10^^, . . . , 10~^°}. 

4. gP^iR the procedure obtained by taking _Fio as given by Equation 5 and 
jFAiR defined by Equation 9. 

5. g^^'^ the procedure obtained by taking Fiq as given by Equation 5 and 
jFDR. g^g defined by Equation 7 replacing the gaussian quantiles with the 
appropriate student quantile function, with q = j /log{p) and 7 chosen by 
10-fold cross validation over an exponential grid of {10", 10^^, . . . , 10"^"}. 

6. g^^' is nearest shrunken centroid classification procedure as defined in [20]. 
We used the corresponding R implementation in package pamr. 

7. g^^ the procedure obtained by taking Fio as given by Equation 5 and 
I^'-^ as defined by Equation 10, with q chosen by 10-fold cross validation 
over a grid of g = {0.2, 0.1, 0.05, 0.01}. 

We made two different simulations for the numerical experiments: 

• Simulation 1 

/io = 0, = 311,-4, and C ~ diag{array{l,p)). 

• Simulation 2 

= 0, /^i[l : 4] = [0.01, 0.5, 0.02, 0.5]/3, : p] 0, 
and C = diag{array{c{0.01, 2),p)). 

where the definition of C is given in R language. 
All the results shown in the following tables have been obtained by repeating the 
experiment 100 times and averaging the error rate (which are given in %). The 
corresponding R code is available ^. An R package will be implemented in the 
future including more plugin type high dimensional classification procedures. 

^robin. girard@mines-paristech.fr 
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Simulation 1 The results from the first experiment confirm the poor perfor- 
mances of the Fisher rule with respect to the other rules (which are all based 
on a dimensionality reduction procedure) . Procedures using cross validation for 
tuning of the thresholding parameter perform best (procedures g^^^, g^^'^''-'^^ ^ 
g^^ and g^'^^^ent f,j.Qgg validation). Note that standard deviation ranges 
from 2 to 5 (in the case when n = 50 or 100) or even 7 to 8 (for n = 20). 









n = 


20 






p 


a' 










gFAlH 


100 


13.90 


17.27 


7.95 


8.85 


8.15 


13.75 


500 


23.97 


29.82 


8.47 


8.75 


8.57 


24.85 








n = 


50 






P 










gbtd 


gFAJH 


100 


8.75 


10.39 


6.79 


7.08 


6.77 


8.19 


500 


14.1 


19.87 


7.07 


7.47 


7.10 


18.48 








n = 


100 






P 


5^ 








gbtd 




100 


7.91 


8.89 


6.93 


7.05 


6.92 


7.55 


500 


10.67 


15.31 


7.04 


7.05 


7.07 


10.44 



Table 1: Results obtained for n = 20,50,100, p = 100,500 with 
Simulation 1. 



Simulation 2 In the second simulation the signal is really hard to distinguish 
and there are interesting features respectively with small and large variance. The 
results show the importance of using cross validation. We also see that the FAIR 
rule (which does not use cross validation) performs better than the Universal 
thresholding rule especially for moderate dimension (see n = 50 p = 100 or 500). 





71 = 10 


P 


9' 




gFUH 




gSU 


gFAlH 




100 


35.6 


37.4 


21.85 


20.70 


22.35 


31.25 


22.65 


500 


42.5 


44.05 


30.6 


25.95 


28.85 


38.6 


30.95 


5000 


47.05 


46.9 


39.2 


33.75 


39.45 


45.75 


40.00 




n = 20 


P 


9' 


gb'isher 


gFUH 




giitd 


gFAlH 


gHC 


100 


30.22 


32.32 


15.17 


15.37 


15.32 


23.27 


17.80 


500 


38.6 


40.57 


17.02 


16.05 


16.27 


34.25 


20.30 


5000 


47.25 


47.95 


22.77 


19.4 


22.77 


45.00 


30.45 




n = 50 


P 


9' 




g,'UH 


gSC 


giitd 


gFAlH 


gHC 


100 


22.42 


24.95 


12.5 


12.91 


12.51 


17.15 


16.48 


500 


34.14 


36.21 


12.92 


13.03 


12.54 


28.66 


16.85 
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5000 I 43.69 45.31 12.36 12.82 12.47 42.18 20.04 | 

Tabic 2: Results obtained for n = 20, 50, 100, p = 100, 500, 5000 
with Simulation 2. 



4. Conclusions 

We have studied the problem of discrimination in a gaussian framework of high 
dimension. We have shown, with finite sample lower bounds, that standard 
procedures fail in high dimension (p » n), and have proposed procedures to 
resolve this problem. These procedures are based on a dimensionality reduction 
technique. They also can be interpreted as thresholding estimators of the normal 
vector Fio to the optimal separating hyperplan : {a; € : (i^io, a; — sio)fip = 0}. 
We have given upper bounds on the excess risk associated to these procedures 
that exhibit a fast rate of convergence under a sparsity assumption. These upper 
bounds have been derived from a general theorem (Theorem 5.1) which may 
bring an interest on its own for people willing to prove convergence of other 
procedure in the framework of linear discriminant analysis. We have provided 
numerical results that confirm the theoretical development of the paper. The 
case when Pq and Pi are gaussian with different covariances can be treated 
with similar ideas (see the author's work [13] but no satisfactory theoretical 
results exist in this case) and will be investigated in further work. The case 
when the covariance matrix C is not diagonal will also be the purpose of a 
further investigation. Futur work will discuss an evaluation of robustness for 
the procedure with respect to non gaussian data, numerically and theoretically. 

5. Proofs 

5.1. Fundamental Theorem 

Theorem 5.1. Suppose g is given by 2 with sio = sio- Let us define 

do = ^ (Ao,sio - sio)mp- (13) 

ll-P^loilL2(Pc) 

Then if a ~ aL^(^p){Fio, Fiq), we have: 

1 ( l-cos(a)\ ii^io"i2(Pc) 

-P < Af{0, 1) < ||i^io|U.(Po) ^ j e ^ < C{g) - C* (14) 

and 

C{g) -C* <cP (|AA(0, 1)1 < (1 - cosa)|lFio!lL,(P^)) + cdl (15) 
for a universal constant c > 
Comments 
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1. These bounds give the relation between (a, ||-F'io||l2(Pc)i Mo|) a-nd the ex- 
cess risk. I do I is the error term related to the estimation of sio and a is 
the error term related to the estimation of Fiq . 

2. When sio — siq (i.e do ~ 0) and ||^"ioI1l2(J'c) fixed and positive, it is 
necessary to have a tending to zero in order to have an excess risk tending 
to zero. Moreover, we see that, in this case, there exists < Ci < C2 such 
that 

Cia^ <C{g)-C* < C20?. 

3. Recall that d = ^^"^^"^^^2(Pc> ^^-^-^ seen as a theoretical measure of the 
separation between Pi and Pq (note that the Hellinger distance can also 
be expressed as a function of d). Large values of d are associated to well 
separated data and small values of d to non separated data. Although, 
Inequality 14 can be used as a contribution to the problem of finding 
necessary condition for the separation (by a classification rule) of gaussian 
mixtures (such as it is treated in [7]). 

4. If is the orthogonal projection operator in L2{Pc) one can see that : 

II^1oI1l2(Pc)(1 - cos(a)) = ||i^lollL2(Pc) ~ l|nFio-P'lollL2(Pc) 



< min<^ \\^f,\Fw\\l2{Pc)^ 



l%-"o-^iolli2(Pc) 
2|1-F'io||l2(Pc) 



and in particular 



llFio||L2(Pc.)(l-cos(a)) < --^""V (16) 

'^\\-fw\\L2(Pc) 

When do = 0, the upper bound in this last equation is sharper than the 
upper bound we have by the following standard sequence of inequalities 

E[C((?) - C*] = E[|2?/(X) - llUg.^g] ( with 77(X) = nX\Y]) 

= E[|V;(e^-)|V#9] 

\ — X dPi 
( with i^{x) = ——) and £10 = log( — )) 
1 + X dPo 

< E[|/:io|]l ■ ,r ^^ ■ ir ^ 
— LI iui sigri(£io)#,sjgn(£io)J 

( with £10 ^ ^(Ao,Sio - x)^p) 



<E[|£io-£lo|] 

< c (e[||Fio - Ao|li2(Pc)])'^' ( " > 

which, if ||^\ol!i2(Pp) J'^^^i^^ bounded from below (this can be seen as a 
margin assumption), is the square root of what can be derived from (16). 
It is also sharper than the bound given at the end of Section 2 in [4] . 
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5.2. Proof of Theorem 3.1 

Proof. We separate the proof into 3 steps. 

Step 1 First, with Theorem 5.1 Equation 15 and Equation 16 wc have: 

""lO 1 1 La (Pc) 



Ep»„ [C{g)-C*] < cEp»„ 
On the one hand, 



\Fio — Fu 



W\\L2{Pc) 



Mo I 



E 



A [\do\' 



E 



10 



SlO ^ SlO 



10||L2(Pc) 



< 



for a given constant c'. On the other hand, by construction: 

^=|lAo - FiollL(P^) ll(^M^ioWl,e/).=i P - (f^Mi^ioW)z=i,...,p||Rp, 

and 



2 a^[i\ ( ■mia[i\ 



'iei 



< 



(7[zl 



'lOL 



1 



Using standard inequaUty around the convergence of [i] to cr^ [i] , one can show, 
summing up over i G { 1 , . . . , p} , that there exist a constant c > such that 



E[A] < c E 



cr z 



i— 1, . . .,p 



+ \\Fio\\l2(Pc): 



(17) 

Hence, it only remains to bound the expectation in the right side of the pre- 
ceding equation, say E[i3]. In both case (step 2 and step 3) we wih use the fact 
that the covariance matrix of the vector C~^/^mio equals Ip-^- 

Step 2 : the case of the universal procedure 

In the case of the universal procedure. 



m-ioW 



iei 



where Y = C~^^'^mio 



i— 1, . . .,p 



VVar(y[i]) 



l>^V2T^ 



i— 1, . . . ,p 
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Following the notations of Theorem 4 (with n replaced by p) from Donoho 
and Johnstone [9], we set 



7 a positive constant, ep a positive sequence decreasing to zero and define three 
different events : 

E^{{l-j) loglog(p) < ll - 21og(p) < cep log(p)} 

= {(1 - 7)loglog(p) > ll - 21og(p)} E+ = {ll - 21og(p) > ceplog(p)} . 
From the bayes formula we get: 

E[B] < E[B\E] + {P{E+) + P{E_))E[B]. 

We also have 

E[B] < E[\\YWU + ||Fio||i,(p,) = 2||Fio||L,(Pe) + 2p 

Concentration inequalities for a chi square random variables U with n—1 degrees 
of freedom (see for example comments on Lemma 1 in [16]) give (for n > 4) 

/ 71 — 1 

PiE+) <P{U-in-l)> -—ce, 



<P[U -{n-\)> V^^^^I^cep + ^^^-^cej 
log(p) 

P71 



< e "4 ^"p o 



(because log(p) << -Jn) and 



P(i?-) < P ( [/ - (n - 1) < (l^lKli^e- 



2 log log(p) 



f log(p) 

< g 2 log log(p) = ^ I 



This ends the proof. 

Step 3 : the case of the FDR procedure 

Theorem 1.1 of Abramovich an .al [1], and Theoreme 5 point 36. of Donoho 
and Johnstone [10] then lead to the desired result. 

□ 
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5.3. Proof of Theorem 5.1 

In this proof, we will use the following subset of MP: 
V^{xeRP : (Fio,a;-sio)Kp > 0, V = {x£RP : (Fm, x - sio)rp > 0} 

V2 = {xeWP : {C'/^Fio,x)s,. >do V2^{xeW : {C'^^F,o,x)r, > 0}, 

where do is defined by Equation 13. The proof is divided into four steps: in 
the first one we make a change of geometry and in the second one we obtain a 
simple expression with gaussian measure of subsets or M?. In the third one we 
derive the lower bound and in the fourth one the upper bound. 

Step 1. We have 

Cig) - C{g*) - \ (P^{V \ V) - P^{V \V) + P,{V\V)- PiiV \ V)) 
= I {PioiV \ V - mio) - Pw{V \ V - mio) 
+ PioiV\V + mio) -Pio{V\V + mio)) 

where Pio is the gaussian probability distribution with covariance C and mean 
sio, and mio = P'^~^° . Changing the geometry now gives 

C{g) - C{g*) =i (P(e - C^^^F,o/2 G % \ V2) - P{S, - C^'^Fw^ e V2 \ V2) 

+ P(e + C^'^Fw/2 e V2 \ V2) - P(e + C^'^Fw/2 e V2 \ V2) 

where ^ is a gaussian random variable on MP with mean and covariance Ip. 
Notice that if a = aL.(Pc) (Ao, ^^lo) (C(.g) - C{g*)){a) = (C(.g) - C(.g*))(-a), 
also, we will suppose without loss of generality that a > in the rest of the proof. 

Step 2. This step is roughly a geometric exercise in (more precisely the 
span of C^/^Fio and C^^^Fio in or the span of (Fio,.)kp and (Fio,.)kp in 
L2{Pc))- First, it is easy to see (with step 1 result) that with a symmetry 
argument, we have 

C(.g) -C(.g*) = i (P(AA(0,/2) e G+) - P{N{QJ2) e G-)) (18) 

where 6*+ and G_ are subsets of R'^ defined by Figure 1 with d and I given by: 

d=^i^%^and^- 



sin(a)||Fio||L2(Pc) 
(note that obtaining I needs a small calculation with R^ geometry). 



Imsart-generlc ver. 2009/08/13 file: infbounds.tex date: February 19, 2010 



R, Girard/High dimensional discriminant analysis 



17 




Figure 1 . Figure giving the definition of G-f and G— . 




Figure 2 . On the left: Figure giving the definition of Hj^ and H— . On the Right figure defining 
a and b. 
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Step 3 : The lower bound. For the lower bound, wc shall first notice that 
C{9)-C{g*) >Up (m,l2) +( i)eG+]-P (mJ2) +( [)eG^ 



2 \ \ ^ ' " \ J +y V ' \ 

and by symmetry, this gives 

C{g)-C{g*) > P(AA(0,/2) G H+) - P[N[Q,h) £ (19) 

where H+ and i7_ are given defined by Figure 2. Let B be the orthogonal 
projection of O on to the bisector of a in the Figure defining H+ and (i.e 
Figure 2 on the left). Let us define H = Hj^\Sb{H-) (see Figure 2 on the right) 
where Sb is the symmetry of center B (also the symmetry of axe (O, B)). One 
can see that with this construction and the preceding equation, we have: 

C{g)-C{g*)>P{M{0,h)eH). 

From this equality and standard inequality on gaussian measures, we get 

P(AA(0,1) G [0,6]) _(fL±io^ 



C{g)-C{g*)> 



-e 



2 

where a and h are defined by Figure 2 on the right and can be calculated easely: 

, A\Fw\\l2{Pc) I1-F'ioI1l2(Pc) 

o=(l-cosa — - a — cosa ^ — -. 

^ ' 2 2 

This gives the announced lower bound. 

Step 4 : The upper bound. First, we notice that 

C{g) - C{g*) =C{g)- C{g) + C{g) -C{g*) 

where 

~ ^ f 1 if {Fw, x - sio)rp > 
\ otherwise 

With step two (setting o?o = 0), we have 

C[g) - C{g*) < P{N{0, h) e H+) - P{N{Q, h) e i?_) = P(A/'(0, h) € H). 
From this equality and standard inequality on gaussian measures, we get 

C(5) - C{g*) < P(AA(0, 1) e [0, 26])e-^. 

It now remains to bound C{g) — C{g). We have, following the same type of 
calculation wc had in step 1, the following equality : 

C(g)-C(5) = P(ee [0;e])-P(ee [-e;0]) 

with 

^-^7V(m,cr^), e = cr|do|, fJ = ||Pio||l2(Pc) ^^d m = (Pio, Fio)l2(Pc)- 
Also, the desired bound follows directly from the following lemma. 
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Lemma 5.1. //rn G M, ct > 0, ^ Af{m, a^) and e > 0, then there exists c > 
such that ^ 

P(e€ [0;e])-P(^€ [-6;0])<c^ 

Proof. Let us call R the left side of the inequality to be proved and set = 
P(7V^(0,1) < x). We have 









e m\ 


- 2$ ( 


-7) 




a / 




a G ) 







which gives the desired result with taylor expension since there exist C > 
such that |$"| < C. □ 



5.4. Proof of Proposition 2.1 point 1 

Proof. The proof is based on ideas from Bickel and Levina [4] used in their 
Theorem 1 : if C is the identity their exist ^i,...,^p. p valued random 
variables forming an orthonormal basis of Rp, a random vector (Ai, . . . , A,i) of 
K" whose property are the following. 

1. The Ai are independent between each other, independent from 
and n\i follows a distribution with n — 1 degrees of freedom. 

2. For every ^.j is drawn in an independent and uniform fashion on the 
intersection of the unitary sphere of W and the orthogonal to ^1, ... , 

3. The empirical estimator C oi C verify : 



1=1 



where if x,y g MP, x ® y is the linear operator of Wp that associate to 
z & W the vector (.t, z)Rpy. 

When C not necessarily equals Ip, we get, almost-surely : 

n ^1 

(^-1/2(^^-1/2 ^ y ^^^^ ^ ^1/2(^-^1/2 

Then, if we define pi = (C~^/^mio, ^i)^pj we have the following equations 



(Pio, Pio)L.(Pe) - {C-^'^^w. C^'^C-C^'^C-^'^m^^)^. = E T^: (20) 

»=i 

n p 

II AoiiLcp,) - E ft ii^ioiiL(Pc) = E (21) 

1=1 i=l 

For reasons of symmetry (the are drawn uniformly on the sphere), we have 
for all subset /„ from { 1 , . . . , p} of size n : 

= IE ^ - -, (22) 
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From equations (20) and (21), if a = aL2(p)(-Fio, Fio), we have ( Cauchy- 
Schwartz inequality ): 



cos (a) 



ft 



|a|<7r/2 



Hence, with Jensen inequahty and Equation (22), this gives E[cos(a)] < ^/ ^• 
This and inequahty (14) leads to the desired result. 



p 
□ 



5. 5. Proof of proposition 2. 1 point 2 

Proof. As in the preceding proposition, we are going to use Inequality (14). Also 
it is sufficient to show the following 



E [cos(a)l|a|<^/2 
We not that suffices to obtain 

K-F'lO, ^lo) La (Pc) I 



< 



1 



(y^|lFio|U,(P,) + l) 



E 



|-F'io||l2(Pc)II^io||l2(Pc) 
On the other hand. 



< 



1 



(v^I1^^ioI1l2(Pc) + 1)- (23) 



E 



\Fw\\l2(Pc)\\^w\\l2{Pc) 

Ww\\l^(Pc) 



< E 



1^io||l2(Pc) 



\Fio\\l2{Pc) 



-E 



< E 



ioIIl2(Pc) 



-, 1/2 






1 +E 



\{Fio,Fio — Fio)l2{Pc)\ 

\\Fio\\l2{Pc)\\Fw\\l2{Pc) 
-, l/2> 



2(PC) 



l^iolli2(Pc) 



where this last inequality results from Cauchy-Scwartz. Recall that 

Fia — Fio H 1^^, 

where ^ is a standardised gaussian random vector of MP . Also, we easily obtain, 

1/2 



E 



ll^lollLlPc) 



1 



and 



l^iollL(Pc) _ \\V^C'/'F,o\\l 



l^lollL(Pc) 



iiv^ci/2i.io + eiii 
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The rest of the proof follows from the following simple faet whieh is a eonse- 
quence of Cochran Theorem and classical calculation on random variables : 
Let a > 0, (3 G M.P, X a gaussian random vector of MP with mean /3 and 
covariance /„. Then 



. II-'^IIrp 

□ 



1 

< . 

- p-2 
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